Prediction of Suppliment Anesthesia

Summary

Utilizing Machine learning techniques to generate value from a data set of Pulp Sensibility. Using supervised learning algorithms to solve the classification problem of predicting the need of suppliment and also try to find out the root cause of this problem

Skills and Tools Used

Skills

  • Machine learning
    • Regression
    • Classification
    • Clustering
  • Exploratory data analysis
  • Data cleaning
  • Data visualization

Data Collection

Data Set Selection

We have collected 128 patients data with 20 features including the binary target feature whether a patient Need Supliment or not.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
In [2]:
df = pd.read_csv("Pulp Sensibility.csv")
In [3]:
df.drop_duplicates(inplace=True)
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 127 entries, 0 to 126
Data columns (total 20 columns):
 #   Column                                    Non-Null Count  Dtype 
---  ------                                    --------------  ----- 
 0   Patient                                   127 non-null    object
 1   Age                                       127 non-null    int64 
 2   Dental History                            127 non-null    int64 
 3   Medical History                           127 non-null    object
 4   Pain (VAS)                                127 non-null    int64 
 5   Pain ( Duration) days                     127 non-null    int64 
 6   Percussion                                127 non-null    int64 
 7   Palpation                                 127 non-null    int64 
 8   Mobility                                  127 non-null    int64 
 9   PDL involvement                           127 non-null    int64 
 10  Curved Canal                              127 non-null    int64 
 11  Pulp stone or and Calcification           127 non-null    int64 
 12  PDL space                                 127 non-null    int64 
 13  Lamina Dura                               127 non-null    object
 14  Cold test ( VAS) Before anaesthesia       127 non-null    int64 
 15  Cold test (Duration) Before anaesthesia   127 non-null    int64 
 16  EPT ( VAS) before anaesthesia             127 non-null    int64 
 17  EPT current pass                          127 non-null    int64 
 18  EPT (Duration) before anaesthesia         127 non-null    int64 
 19  Need Supliment                            127 non-null    int64 
dtypes: int64(17), object(3)
memory usage: 20.8+ KB

Data Preview

In [5]:
df.head()
Out[5]:
Patient Age Dental History Medical History Pain (VAS) Pain ( Duration) days Percussion Palpation Mobility PDL involvement Curved Canal Pulp stone or and Calcification PDL space Lamina Dura Cold test ( VAS) Before anaesthesia Cold test (Duration) Before anaesthesia EPT ( VAS) before anaesthesia EPT current pass EPT (Duration) before anaesthesia Need Supliment
0 F 37 0 0 6 30 2 0 0 1 1 0 1 LOSS 0 0 5 32 5 1
1 F 47 0 0 2 30 0 0 0 0 0 0 0 0 3 3 0 80 3 1
2 F 27 0 0 6 7 0 0 0 0 0 0 0 0 7 23 5 27 19 1
3 M 27 0 0 6 7 0 0 0 0 0 0 0 0 7 27 5 21 37 1
4 M 23 0 0 4 60 1 0 0 0 1 0 1 0 5 12 2 43 5 1

Exploratory Data Analysis

Data Profile Report

Data profile report to explore the contents of the collected data set.

In [6]:
from pandas_profiling import ProfileReport
C:\Users\user\AppData\Local\Temp\ipykernel_34516\2274191625.py:1: DeprecationWarning: `import pandas_profiling` is going to be deprecated by April 1st. Please use `import ydata_profiling` instead.
  from pandas_profiling import ProfileReport
In [7]:
# raw_data = pd.read_csv("Pulp Sensibility.csv", index_col=False)
# df_profile = df.copy()
# Create and display a report summarizing the data in the hotel bookings data set
profile = ProfileReport(df,
                        title='Pulp Sensibility Data Profile Report',
                        html={'style': {
                            'full_width': True
                        }})
profile.to_notebook_iframe()

Bar Chart with Categorical Features

In [8]:
df_cat = df[['Patient', 'Dental History', 'Medical History', 
       'Percussion ', 'Palpation', 'Mobility',
       'PDL involvement', 'Curved Canal ', 'Pulp stone or and Calcification',
       'PDL space', 'Lamina Dura',
       'Need Supliment']]
In [9]:
df_cat.head()
Out[9]:
Patient Dental History Medical History Percussion Palpation Mobility PDL involvement Curved Canal Pulp stone or and Calcification PDL space Lamina Dura Need Supliment
0 F 0 0 2 0 0 1 1 0 1 LOSS 1
1 F 0 0 0 0 0 0 0 0 0 0 1
2 F 0 0 0 0 0 0 0 0 0 0 1
3 M 0 0 0 0 0 0 0 0 0 0 1
4 M 0 0 1 0 0 0 1 0 1 0 1
In [10]:
df_num = df.drop(df_cat.columns, axis=1)
In [11]:
df_num.head()
Out[11]:
Age Pain (VAS) Pain ( Duration) days Cold test ( VAS) Before anaesthesia Cold test (Duration) Before anaesthesia EPT ( VAS) before anaesthesia EPT current pass EPT (Duration) before anaesthesia
0 37 6 30 0 0 5 32 5
1 47 2 30 3 3 0 80 3
2 27 6 7 7 23 5 27 19
3 27 6 7 7 27 5 21 37
4 23 4 60 5 12 2 43 5
In [12]:
fig = px.bar(df['Need Supliment'].value_counts(), color =df['Need Supliment'].value_counts().index, text_auto=True, labels = dict(index = "Requirements of Suppliment",value = "Total Number of Patients"))
fig.show()

Observations:

  • In our Dataset Out of 127 patients, 84 Patients required Suppliment and 43 patients doesn't need any suppliments.
  • Though its an imbalanced dataset but still we can get few insights
In [13]:
def bar_chart(feature):
    suppliment_need = df[df['Need Supliment']==1][feature].value_counts()
    no_suppliment = df[df['Need Supliment']==0][feature].value_counts()
    df_view = pd.DataFrame([suppliment_need,no_suppliment])
    df_view.index = ['Suppliment (Required)','Suppliment (Not Required)']
    fig = px.bar(df_view,barmode='group', text_auto=True, labels = dict(index = "Requirements of Suppliment",value = "Total Number of Patients"))
    fig.show()
In [14]:
bar_chart('Patient')

Observations:

  • Male Patients need more suppliment than Female
In [15]:
bar_chart('Medical History')

Observations:

  • Though large number of patients don't have any previous medical history but a significant number of Patients who have medical history like "DM" they need for suppliment.
In [34]:
bar_chart('Dental History')

Observations:

  • Though patients who doesn't have previous dental history, still they need to give Suppliment
In [16]:
bar_chart('Pulp stone or and Calcification')

Observations:

  • Suppliment Required if calcified tissue present on the level of the pulp chamber and roots of the teeth
In [17]:
fig = px.histogram(df,x='Age',text_auto=True)
fig.show()

Observations:

  • Looking at the age distribution of the dataset, we can see there is a bimodal distribution of Age. Mainly our experiment based on two main group (1) 25-34 Age and (2) 55-59 Age
In [18]:
fig = px.bar(data_frame=df.groupby(['Need Supliment']).mean().reset_index(),x="Need Supliment",y="Age", text_auto=True)
fig.show()

Observations:

  • From above graph shows on average older age people more inclined to suppliment.
In [19]:
fig = px.bar(data_frame=df.groupby(['Need Supliment']).mean().reset_index(),x="Need Supliment",y="Pain ( Duration) days", text_auto=True)
fig.show()

Observations:

  • If pain duration is longer then there a huge need of suppliment. If average pain duration is less than 10 days then usually no suppliment needed.
In [20]:
fig = px.bar(data_frame=df.groupby(['Need Supliment']).mean().reset_index(),x="Need Supliment",y='EPT (Duration) before anaesthesia ', text_auto=True)
fig.show()

Observations:

  • If Electric pulp testing (EPT) duration is more than 6 min then there is a need for suppliment.

Classification Problem

Feature Engineering

In [22]:
df['Lamina Dura'] = df['Lamina Dura'].map({'0':0,'LOSS':1})
df['Patient'] = df['Patient'].map({'M':1,'F':0})
In [28]:
df.head()
Out[28]:
Patient Age Dental History Medical History Pain (VAS) Pain ( Duration) days Percussion Palpation Mobility PDL involvement Curved Canal Pulp stone or and Calcification PDL space Lamina Dura Cold test ( VAS) Before anaesthesia Cold test (Duration) Before anaesthesia EPT ( VAS) before anaesthesia EPT current pass EPT (Duration) before anaesthesia Need Supliment
0 0 37 0 0 6 30 2 0 0 1 1 0 1 1 0 0 5 32 5 1
1 0 47 0 0 2 30 0 0 0 0 0 0 0 0 3 3 0 80 3 1
2 0 27 0 0 6 7 0 0 0 0 0 0 0 0 7 23 5 27 19 1
3 1 27 0 0 6 7 0 0 0 0 0 0 0 0 7 27 5 21 37 1
4 1 23 0 0 4 60 1 0 0 0 1 0 1 0 5 12 2 43 5 1
In [54]:
clean_df = df.copy()
cat_features = clean_df.select_dtypes('object').columns
clean_df = pd.concat([clean_df.drop(cat_features, axis = 1), 
                                pd.get_dummies(clean_df[cat_features])], axis = 1)
clean_df.head()
Out[54]:
Patient Age Dental History Pain (VAS) Pain ( Duration) days Percussion Palpation Mobility PDL involvement Curved Canal ... Medical History_0 Medical History_CARD Medical History_DM Medical History_DM, HTN Medical History_DM, HTN, CAD Medical History_HT0 Medical History_HT0, DM Medical History_HTN Medical History_HTN, CARD Medical History_HTN, DM
0 0 37 0 6 30 2 0 0 1 1 ... 1 0 0 0 0 0 0 0 0 0
1 0 47 0 2 30 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
2 0 27 0 6 7 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
3 1 27 0 6 7 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
4 1 23 0 4 60 1 0 0 0 1 ... 1 0 0 0 0 0 0 0 0 0

5 rows × 29 columns

Features Importance

In [55]:
df2 = pd.DataFrame(clean_df.corrwith(df['Need Supliment']).sort_values(ascending=False))
df2 = df2.set_axis(['Correlation Coefficient'], axis=1).reset_index().rename(columns={'index': 'Features'})
df2
Out[55]:
Features Correlation Coefficient
0 Need Supliment 1.000000
1 Pulp stone or and Calcification 0.422838
2 EPT (Duration) before anaesthesia 0.289927
3 Pain ( Duration) days 0.285700
4 Age 0.273753
5 EPT ( VAS) before anaesthesia 0.251071
6 Cold test (Duration) Before anaesthesia 0.247909
7 EPT current pass 0.239994
8 Cold test ( VAS) Before anaesthesia 0.231929
9 Palpation 0.191575
10 Dental History 0.191014
11 Mobility 0.174159
12 Medical History_DM 0.171354
13 Pain (VAS) 0.161199
14 Percussion 0.159934
15 Medical History_DM, HTN 0.129024
16 Medical History_HTN, CARD 0.129024
17 Medical History_HTN 0.099893
18 Curved Canal 0.094392
19 Medical History_DM, HTN, CAD 0.063740
20 Medical History_HTN, DM 0.063740
21 Patient 0.033549
22 PDL space -0.003579
23 Lamina Dura -0.004864
24 PDL involvement -0.041200
25 Medical History_CARD -0.043146
26 Medical History_HT0 -0.124515
27 Medical History_HT0, DM -0.124515
28 Medical History_0 -0.239033
In [58]:
transformed_df = clean_df[df2[df2['Correlation Coefficient']>= 0.15]['Features'].to_list()]
transformed_df.head()
Out[58]:
Need Supliment Pulp stone or and Calcification EPT (Duration) before anaesthesia Pain ( Duration) days Age EPT ( VAS) before anaesthesia Cold test (Duration) Before anaesthesia EPT current pass Cold test ( VAS) Before anaesthesia Palpation Dental History Mobility Medical History_DM Pain (VAS) Percussion
0 1 0 5 30 37 5 0 32 0 0 0 0 0 6 2
1 1 0 3 30 47 0 3 80 3 0 0 0 0 2 0
2 1 0 19 7 27 5 23 27 7 0 0 0 0 6 0
3 1 0 37 7 27 5 27 21 7 0 0 0 0 6 0
4 1 0 5 60 23 2 12 43 5 0 0 0 0 4 1

Model Training

Train/Test Split

In [59]:
# Import the scikit-learn function used to split the data set
from sklearn.model_selection import train_test_split

# Designate the target feature as "y" and the explanatory features as "x"
y = transformed_df['Need Supliment']
x = transformed_df.drop('Need Supliment', axis=1)

# Create the train and test sets for x and y and specify a random_state seed for reproducibility
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 87)

Cross-Validation

GridSearch cross-validation for the logistic regression model is performed below.

In [62]:
# Import the scikit-learn functions and classes necessary to perform cross-validation
from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV

# Import the functions used to save and load a trained model
from joblib import dump, load

# Import the scikit-learn class used to train a logistic regression model
from sklearn.linear_model import LogisticRegression

# Create a pipeline specifying all of the operations to perform when training the model
# In this case, the pipepline consists of z-score standardization and fitting of a logistic regression model
pipeline_lr = make_pipeline(preprocessing.StandardScaler(), LogisticRegression(max_iter = 150))

# Specify the hyperparameters and their corresponding values that are to be used in GridSearch
hyperparameters_lr = { 'logisticregression__C' : [0.05, 0.1, 0.3, 0.5, 0.7, 0.9, 1] }

# Initialize the GridSearch cross-validation object, specifying 10 folds for 10-fold cross-validation and
# "f1" and "accuracy" as the evaluation metrics for cross-validation scoring
logistic_regression = GridSearchCV(pipeline_lr, hyperparameters_lr, cv = 10, scoring = ['f1', 'accuracy'], 
                                   refit = 'f1', verbose = 0, n_jobs = -1)

# Train and cross-validate the logistic regression model and ignore the function output
_ = logistic_regression.fit(x_train, y_train)

# Save the model so it can be used again without retraining it
_ = dump(logistic_regression, 'logistic_regression.joblib')

GridSearch cross-validation for the KNN model is performed below.

In [63]:
# Import the scikit-learn class used to implement a KNN classifier
from sklearn.neighbors import KNeighborsClassifier

# Create a pipeline specifying all of the operations to perform when training the model
# In this case, the pipepline consists of z-score standardization and initialization of a KNN classifier
pipeline_knn = make_pipeline(preprocessing.StandardScaler(), KNeighborsClassifier(algorithm = 'ball_tree'))

# Specify the hyperparameters and their corresponding values that are to be used in GridSearch
hyperparameters_knn = { 'kneighborsclassifier__n_neighbors' : [3, 5] }

# Initialize the GridSearch cross-validation object, specifying 5 folds for 5-fold cross-validation and
# "f1" and "accuracy" as the evaluation metrics for cross-validation scoring
knn = GridSearchCV(pipeline_knn, hyperparameters_knn, cv = 5, scoring = ['f1', 'accuracy'], 
                   refit = 'f1', verbose = 0, n_jobs = -1)

# Cross-validate the KNN model and ignore the function output
_ = knn.fit(x_train, y_train)

# Save the model so it can be used again without redefining it
_ = dump(knn, 'knn.joblib')

Model Evaluation

Performance on Train and Test Sets

Having trained and cross-validated the models, I then used the models to make predictions on the test set. I evaluated the performance of the models on the test set using the same F1 and accuracy metrics used to evaluate the models during cross-validation. The performance of the models as indicated by these metrics is displayed below.

In [64]:
# Import the scikit-learn functions used to calculate the F1 score and accuracy on the test set
from sklearn.metrics import f1_score, accuracy_score

# Use the best logistic regression model to make predictions on the test set
y_test_pred_lr = logistic_regression.predict(x_test)

# Display the F1 and ROC AUC on the train and test sets for the logistic regression model
print('Logistic regression F1 (train):', 
      round(logistic_regression.cv_results_['mean_test_f1'][logistic_regression.best_index_], 3))
print('Logistic regression F1 (test):', round(f1_score(y_test, y_test_pred_lr), 3), '\n')
print('Logistic regression accuracy (train):', 
      round(logistic_regression.cv_results_['mean_test_accuracy'][logistic_regression.best_index_], 3))
print('Logistic regression accuracy (test):', 
      round(accuracy_score(y_test, y_test_pred_lr), 3), '\n')

# Use the best KNN model to make predictions on the test set
y_test_pred_knn = knn.predict(x_test)

# Display the F1 and ROC AUC on the train and test sets for the KNN model
print('KNN F1 (train):', 
      round(knn.cv_results_['mean_test_f1'][knn.best_index_], 3))
print('KNN F1 (test):', round(f1_score(y_test, y_test_pred_knn), 3), '\n')
print('KNN accuracy (train):', 
      round(knn.cv_results_['mean_test_accuracy'][knn.best_index_], 3))
print('KNN accuracy (test):', 
      round(accuracy_score(y_test, y_test_pred_knn), 3), '\n')
Logistic regression F1 (train): 0.761
Logistic regression F1 (test): 0.895 

Logistic regression accuracy (train): 0.703
Logistic regression accuracy (test): 0.846 

KNN F1 (train): 0.759
KNN F1 (test): 0.865 

KNN accuracy (train): 0.712
KNN accuracy (test): 0.808 

Evaluating Bias vs Variance

To objectively determine the degree of bias and variance exhibited by the models, I used the guidelines presented below.

Bias:

  • High bias: F1 < 0.70
  • Medium bias: 0.70 <= F1 < 0.90
  • Low bias: 0.90 <= F1
In [ ]: